Multimedia integration description scheme, method and system for MPEG-7

ABSTRACT

The invention provides a system and method for integrating multimedia descriptions in a way that allows humans, software components or devices to easily identify, represent, manage, retrieve, and categorize the multimedia content. In this manner, a user who may be interested in locating a specific piece of multimedia content from a database, Internet, or broadcast media, for example, may search for and find the multimedia content. In this regard, the invention provides a system and method that receives multimedia content and separates the multimedia content into separate components which are assigned to multimedia categories, such as image, video, audio, synthetic and text. Within each of the multimedia categories, the multimedia content is classified and descriptions of the multimedia content are generated. The descriptions are then formatted, integrated, using a multimedia integration description scheme, and the multimedia integration description is generated for the multimedia content. The multimedia description is then stored into a database. As a result, a user may query a search engine which then retrieves the multimedia content from the database whose integration description matches the query criteria specified by the user. The search engine can then provide the user a useful search result based on the multimedia integration description.

The present application is a continuation of U.S. patent applicationSer. No. 13/169,330, filed Jun. 27, 2011, which is a continuation ofU.S. patent application Ser. No. 12/372,052, filed Feb. 17, 2009, nowU.S. Pat. No. 7,970,822, issued Jun. 28, 2011, which is a continuationof U.S. patent application Ser. No. 11/278,671, filed Apr. 4, 2006, nowU.S. Pat. No. 7,506,024, issued Mar. 17, 2009, which is a continuationof application Ser. No. 09/495,175, filed on Feb. 1, 2000, now U.S. Pat.No. 7,185,049, issued Feb. 27, 2007, which claims the benefit of U.S.Provisional Application No. 60/118,022, filed Feb. 1, 1999, the contentsof which is incorporated herein by reference in its entirety.

This application is also related to U.S. patent application Ser. No.11/616,315, filed Dec. 27, 2006, now U.S. Pat. No. 7,599,965, issuedOct. 6, 2009, the contents of which is incorporated herein by referencein its entirety.

This application includes an Appendix containing computer code thatperforms content description in accordance with the exemplary embodimentof the present invention. That Appendix of the disclosure of this patentdocument contains material which is subject to copyright protection. Thecopyright owner has no objection to the facsimile reproduction byany-one of the patent document or the patent disclosure, as it appearsin the Patent and Trademark Office patent file or records, but otherwisereserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

1. Field of Invention

The present invention generally relates to audiovisual datarepresentation. More particularly, this invention relates to integratingthe descriptions of multiple categories of audiovisual content to allowsuch content to be searched or browsed with ease in digital libraries,Internet web sites and broadcast media, for example.

2. Description of Related Art

More and more audiovisual information is becoming available from manysources around the world. Such information may be represented by variousforms of media, such as still pictures, video, graphics, 3D models,audio and speech. In general, audiovisual information plays an importantrole in our society, be it recorded in such media as film or magnetictape or originating, in real time, from some audio or visual sensors, beit analogue or, increasingly, digital.

While audio and visual information used to be consumed directly by thehuman being, computational systems are increasingly creating,exchanging, retrieving and re-processing this audiovisual information.Such is the case for image understanding, e.g., surveillance,intelligent vision, smart cameras, etc., media conversion, e.g., speechto text, picture to speech, speech to picture, etc., informationretrieval, e.g., quickly and efficiently searching for various types ofmultimedia documents of interest to the user, and filtering to receiveonly those multimedia data items which satisfy the user's preferences ina stream of audiovisual content.

For example, a code in a television program triggers a suitablyprogrammed VCR to record that program, or an image sensor triggers analarm when a certain visual event happens. Automatic transcoding may beperformed based on a string of characters or audible information or asearch may be performed in a stream of audio or video data. In all theseexamples, the audiovisual information has been suitably “encoded” toenable a device or a computer code to take some action.

In the infancy of web-based information communication and accesssystems, information is routinely transferred, searched, retrieved andprocessed. Presently, much of the information is predominantlyrepresented in text form. This text-based information is accessed usingtext-based search algorithms.

However, as web-based systems and multimedia technology continue toimprove, more and more information is becoming available in a form otherthan text, for instance as images, graphics, speech, animation, video,audio and movies. As the volume of such information is increasing at arapid rate it is becoming important to be easily to be able to searchand retrieve a specific piece of information of interest. It is oftendifficult to search for such information by text-only search. Thus theincreased presence of multimedia information and the need to be able tofind the required portions of it in an easy and reliable manner,irrespective of the search engines employed, has spurred on the drivefor a standard for accessing such information.

The Moving Pictures Expert Group (MPEG) is a working group under theInternational Standards Organization/International ElectrotechnicalCommission in charge of the development of international standards forcompression, decompression, processing and coded representation of videodata, audio data and their combination.

MPEG previously developed the MPEG-1, MPEG-2 and MPEG-4 standards, andis presently developing the MPEG-7 standard, formally called “MultimediaContent Description Interface”, hereby incorporated by reference in itsentirety.

MPEG-7 will be a content representation standard for multimediainformation search and will include techniques for describing individualmedia content and their combination. Thus, MPEG-7 standard is aiming toproviding a set of standardized tools to describe multimedia content.Therefore, the MPEG-7 standard, unlike the MPEG-1, MPEG-2 or MPEG-4standards, is not a media content coding or compression standard butrather a standard for representation of descriptions of media content.The data representing descriptions is called “meta data”. Thus,irrespective of how the media content is represented, i.e., analogue,PCM, MPEG-1, MPEG-2, MPEG-4, Quicktime, Windows Media etc, the metadataassociated with this content, may in future, be MPEG-7.

Often, the value of multimedia information depends on how easily it canbe found, retrieved, accessed, filtered and managed. In spite of thefact that users have increasing access to this audiovisual information,searching, identifying and managing it efficiently is becoming moredifficult because of the sheer volume of the information. Moreover, thequestion of identifying and managing multimedia content is not justrestricted to database retrieval applications such as digital libraries,but extends to areas such as broadcast channel selection, multimediaediting and multimedia directory services.

Although techniques for tagging audiovisual information allow somelimited access and processing based on text-based search engines, theamount of information that may be included in such tags is somewhatlimited. For example, for movie videos, the tag may reflect name of themovie or list of actors etc., but this information may apply to theentire movie and may not be sub-divided to indicate the content ofindividual shots and objects in such shots. Moreover, the amount ofinformation that may be included in such tags and architecture forsearching and processing that information is severely limited.

SUMMARY OF THE INVENTION

The invention provides a system and method for integrating multimediadescriptions in a way that allows humans, software components or devicesto easily identify, manage, manipulate, and categorize the multimediacontent. In this manner, a user who may be interested in locating aspecific piece of multimedia content from a database, Internet, orbroadcast media, for example, may search for and find the of themultimedia content.

In this regard, the invention provides a system and method that receivesmultimedia content from a multimedia stream and separates the multimediacontent into separate components which are assigned to single mediacategories, such as image, video, audio, synthetic audiovisual, andtext. Within each of the single media categories, media events areclassified and descriptions of such single media events are generated.These descriptions are then integrated and formatted, according to amultimedia integration description scheme. Multimedia integrationdescription is then generated for the multimedia content. The multimediadescription is then stored into a database.

As a result, a user may query a search engine which then retrieves themultimedia integration description from the database. The search enginecan then provide the user a useful search result whose multimediaintegration description meets the query requirements.

The exemplary embodiment of the invention addresses the draftrequirements of MPEG-7 promulgated by MPEG at the time of the filing ofthis patent application. That is, the invention providesobject-oriented, generic abstraction and uses objects and events asfundamental entities for description. Thus, the invention provides anefficient framework for description of various types of multimedia data.

The invention is also a comprehensive tool for describing multimediadata because it uses eXtensible Markup Language (XML), which is selfdescribing. The present invention also provides flexibility becauseparts can be instantiated so as to provide efficient organization. Theinvention also provides extensibility and the ability to definerelationships between data because elements defined in the descriptionscheme can be used to derive new elements.

These and other features and advantages of this invention are describedin or are apparent from the following detailed description of the systemand method according to this invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The preferred embodiments of the invention will be described in detailwith reference to the following figures wherein:

FIG. 1A is an exemplary block diagram showing a multimedia integrationsystem;

FIG. 1B is an exemplary block diagram of an exemplary individual mediatype descriptor generation unit shown in FIG. 1A;

FIG. 2 is an exemplary block diagram of the multimedia integrationdescription scheme unit in FIG. 1A;

FIG. 3 is an example of a multimedia stream, consisting of multimediaobjects and single-media objects, and the relationship among theseobjects;

FIG. 4 is a UML representation of the multimedia description scheme atthe multimedia stream level which consists of one or more multimediaobjects;

FIG. 5 is a UML representation of the multimedia description scheme atthe multimedia object level; and

FIG. 6 is an exemplary flowchart showing the process of generatingintegrated multimedia content description for multimedia content.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Prior to explaining the exemplary embodiment of the invention, asynopsis of MPEG-7 is provided to aid in the reader's understanding ofhow the exemplary embodiment processes multimedia data within theconstruct of MPEG-7.

MPEG-7 is the result of a global demand that has logically followed theincreasing availability of digital audiovisual content. Audiovisualinformation, both natural and synthetic, will continue to beincreasingly available from many sources around the world. Also, userswant to use this audiovisual information for various purposes. However,before the information can be used, it must be identified, located,indexed, and even characterized properly. At the same time, theincreasing availability of potentially interesting material makessearching more difficult because of the increasingly voluminous pool ofinformation to be searched.

MPEG-7 is directed at standardizing the interface for describingmultimedia content to allow efficient searching and retrieval forvarious types of multimedia material interesting to the user. MPEG-7 ismeant to provide standardization of multimedia content descriptions.MPEG-7 expects to extend the limited capabilities of proprietarysolutions in identifying content that exist today, notably by includingmore data types. In other words, MPEG-7 will specify a standard set ofdescriptors that can be used to describe various types of multimediainformation. MPEG-7 will also specify predefined structures ofdescriptors and their relationships, as well as ways to define one's ownstructures. These structures are called description schemes (DSs).Defining new description schemes can be performed using a speciallanguage, the description definition language (or DDL), which is also apart of the MPEG-7 standard. The description, i.e., a set ofinstantiated description schemes, is associated with the content itselfto allow fast and efficient searching for material of a user's interest.MPEG-7 will also include coded representations of a description forefficient storage, or fast access.

Conventionally, search engines each have individual syntax formats thatdiffer. These differences in syntax format cause compatibility issuesbetween search criteria, e.g., identical criteria used by differentengines results in different results. With the use of descriptionschemes under MPEG-7, these search engines will be able to processMPEG-7 multimedia contents regardless of the differing syntax formats toproduce the same results.

The requirements of MPEG-7 apply, in principle, to both real time andnon-real time applications. Also, MPEG-7 will apply to push and pullapplications. However, MPEG-7 will not standardize or evaluateapplications. Rather, MPEG-7 will interact with many differentapplications in many different environments, which means that it willneed to provide a flexible and extensible framework for describingmultimedia data.

Therefore, MPEG-7 will not define a monolithic system for contentdescription. Rather, MPEG-7 will define a set of methods and tools fordescribing multimedia data. Thus, MPEG-7 expects to standardize a set ofdescriptors, a set of description schemes, a language to specifydescription schemes (and possibly descriptors), e.g., the descriptiondefinition language, and one or more ways to encode descriptions. Astarting point for the description definition language is the XML,although it is expected that the basic XML will eventually need to becustomized and modified for use in MPEG-7.

The exemplary embodiment of the invention described herein withreference to FIGS. 1A-6 conforms to the requirements of the MPEG-7standard, in its present form.

The following description of the particular embodiment of the inventionuses terminology that is consistent with definitions provided in theMPEG-7 standard. The term “data” indicates audiovisual information thatis described using MPEG-7, regardless of storage, coding, display,transmission, medium or technology. Data encompasses, for example,graphics, still images, video, film, music, speech, sounds, text and anyother relevant audiovisual medium. Examples of such data may be foundin, for example, an MPEG-4 stream, a video tape, a compact disccontaining music, sound or speech, a picture printed on paper or aninteractive multimedia installation on the web.

A “feature” indicates a distinctive characteristic of the data whichsignifies something to someone. Examples of features include imagecolor, speed pitch, audio segment rhythm, video camera motion, videostyle, movie title, actors' names in a movie, etc. Examples of featuresof visual objects include shape, surface, complexity motion, light,color, texture, shininess and transparency.

A “descriptor” is a representation of a feature. It is possible to haveseveral descriptors representing a single feature. A descriptor definesthe syntax and the semantics of the feature representation and allowsthe evaluation of the corresponding feature via the descriptor value.Examples of such descriptors include color histogram, frequencycomponent average, motion field, title text, etc.

A “descriptor value” is an instantiation of a descriptor for a givendata set. Descriptor values are combined using a description scheme toform a description.

A “description scheme”, specifies the structure and semantics ofrelationships between its components, which may be both descriptors anddescription schemes. The distinction between a description scheme and adescriptor is that a descriptor contains only basic data types, asprovided by the description definition language. A descriptor also doesnot refer to another descriptor or description scheme.

A “description” is the result of instantiating a description scheme. Toinstantiate a description scheme, a set of descriptor values thatdescribe the data is structured according to a description scheme.Depending on the completeness of the set of descriptor values, thedescription scheme may be fully or partially instantiated. Additionally,it is possible that the description scheme may be merely incorporated byreference in the description rather than being actually present in thedescription. A “coded description” is a description that has beenencoded to fulfill relevant requirements such as compression efficiency,error resilience, random access, etc.

The “description definition language (DDL)” is the language that allowsthe creation of new description schemes and, possibly, new descriptors.The description definition language also allows the extension andmodification of existing description schemes.

MPEG-7 data may be physically located with the associated audiovisualmaterial, in the same data stream or on the same storage system, but thedescriptions can also be stored elsewhere. When the content and itsdescriptions are not co-located, mechanisms that link audiovisualmaterial and their MPEG-7 descriptions are needed; these links must workin both directions.

The exemplary embodiment meets the present MPEG-7 requirements outlinedin the present draft of MPEG-7 standard requirements. Requirementsinclude criteria relating to descriptors, description schemerequirements, the description definition language requirements andsystem requirements. While the exemplary embodiment of the inventionshould satisfy all requirements of MPEG-7 when taken as a whole, not allrequirements have to be satisfied by each individual descriptor ordescription scheme.

The descriptor requirements include cross-modality, direct datamanipulation, data adaptation, language of text-based descriptions,linking, prioritization of related information and uniqueidentification. Description scheme requirements include descriptionscheme relationships, descriptor prioritization, descriptor hierarchy,descriptor scalability, temporal range description, data adaptation,compositional capabilities, unique identification, primitive data types,composite data types, multiple media types, various types of descriptionscheme instantiations, relationships within a description scheme andbetween description schemes, relationship between description and data,links to ontologies, platform independence, grammar, constraintvalidation, intellectual property management and protection, humanreadability and real time support.

While a description scheme can be generated using any descriptiondefinition language, the exemplary embodiment of the invention useseXtensible Markup Language (XML) to represent the integrationdescription scheme. XML is a useful subset of SGML. XML is easier tolearn, use and implement than SGML. XML allows for self description,i.e., allows description and structure of description in the same formatand document. Use of XML also allows linking of collections of data byimporting external document type definitions using description schemes.

Additionally, XML is highly modular and extensible. XML provides a selfdescribing and extensible mechanism, and although not media centric, canprovide a reasonable starting basis. Another major advantage of usingXML is that it allows the descriptions to be self-describing, in thesense that they combine the description and the structure of thedescription in the same format and document. XML also provides thecapability to import external document type definitions (or DTDs), e.g.,for feature descriptors, into the image description scheme document typedefinitions in a highly modular and extensible way.

According to the exemplary embodiment of the invention, each multimediacomponent description can include multimedia component objects. Eachmultimedia component object has one or more associated multimediacomponent features. The multimedia component features of an object aregrouped together as being visual, audio or a relationship on semantic ormedia. In the multimedia component description scheme, each feature ofan object has one or more associated descriptors.

The multimedia description scheme also includes specific document typedefinitions, also generated using the XML framework, to provide exampledescriptors. The document type definition provides a list of theelements, tags, attributes, and entities contained in the document, andtheir relationships to each other. Document type definitions specify aset of rules for the structure of a document. For example, a documenttype definition specifies the parameters needed for certain kinds ofdocuments. Using the multimedia description scheme, document typedefinitions may be included in the file that contain the document theydescribe. In such a case, the document type definition is included in adocument's prolog after the XML declaration and before the actualdocument data begins. Alternatively, the document type definition may belinked to the file from an external URL. Such external document typedefinitions can be shared by different documents and Web sites. In sucha way, for a given descriptor, the multimedia description scheme canprovide a link to external descriptor extraction code and descriptorsimilarity code.

Audiovisual material that has MPEG-7 data associated with it, mayinclude still pictures, graphics, 3D models, audio, speech, video, andinformation about how these elements are combined in a multimediapresentation, i.e., scenario composition information. A special case ofthis general data type may include facial expressions and personalcharacteristics.

FIG. 1A is a block diagram of an exemplary multimedia descriptionintegration system 100. The multimedia integration description system100 includes global media description unit 110, local media descriptionunit 120, integration descriptors unit 150, multimedia integrationdescription scheme unit 160, and multimedia integration descriptiongenerator 165.

The global media description unit 110 includes a global descriptiongeneration unit 115 which receives multimedia content and providesglobal descriptions to the integration descriptors unit 150. The globaldescriptions provided to the integration descriptors unit 150 aredescriptions that are relevant to the multimedia content as a whole,such as time, duration, space, etc. The local media description unit 120includes description generation units for various categories ofmultimedia content including image description generation unit 125,video description generation unit 130, audio description generation unit135, synthetic audiovisual description generation unit 140 and textdescription generation unit 145.

While FIG. 1A illustrates the relationship between the integrationdescription scheme and five categories of single media descriptions, oneskilled in the art may appreciate that these categories are exemplaryand therefore, may be subdivided or reclassified into a greater orlesser number of categories. In that regard, the exemplary embodimentillustrated in FIG. 1A illustrates how multimedia content can be dividedand categorized into 5 various descriptions categories.

The multimedia integration description scheme unit 160 containsdescription schemes for integrating one or more of the descriptioncategories. In other words, the multimedia integration descriptionscheme unit 160 maps how image, video, audio, synthetic and/or textdescriptions, as descriptions of the component objects of a multimediaobject, should be combined to form the description of the compositemultimedia object and to be stored for easy retrieval. The multimediaintegration description scheme unit 160 also provides input to theintegration descriptors unit 150 so that it can provide properdescriptor values to the multimedia integration description generator165.

The multimedia integration description generator 165 generates amultimedia integration description based on the one or more of theimage, video, audio, synthetic and/or text descriptions received fromthe description generation units 125-145, the descriptor values receivedfrom the integration descriptors unit 150 and the multimedia integrationdescription scheme received from the multimedia integration descriptionscheme unit 160. The multimedia integration description generator 165generates the multimedia integration description and stores thedescription in the database 170.

Once the multimedia integration description has been stored, a userterminal 180, for example, may request multimedia content from a searchengine 175. The search engine 175 then retrieves the multimedia contentdescriptions, whose multimedia integration descriptions meet what theuser requested, from the database 170 and provides the retrievedmultimedia content descriptions to the user at terminal 180.

FIG. 1B is an exemplary block diagram of one of the individual mediatype descriptor generation units (i.e., the image, video, audio,synthetic and text description generation units 125-145) shown in FIG.1A. The individual media type description generation unit 121 includes afeature extractor and descriptors representation unit 122, an individualmedia type content description generator 123, and individual media typedescription scheme unit 124.

The feature extractor and descriptors representation unit 122 receivesindividual media type content and extract features from the content. Theextracted features are represented by descriptor values which are outputto the individual media type content description generator 123. Theindividual media type content description generator 123 uses theindividual media type description scheme provided by the individualmedia type description scheme unit 124 and the descriptor valuesprovided by feature extractor and descriptors representation unit 122 tooutput the content description which is sent to the multimediaintegration description generator 165, shown in FIG. 1A.

FIG. 2 shows the multimedia integration scheme unit 160. The multimediaintegration scheme unit 160 interacts with a global media descriptionscheme 115, an image description scheme 220, a video description scheme230, an audio description scheme 240, a synthetic audiovisualdescription scheme 250 and a text description scheme 260, which providedescription schemes inputs to the multimedia integration descriptionscheme 210. The multimedia integration description scheme 210 alsoreceives input from integration descriptors 205.

In this manner, the integration descriptors 205 and the descriptionschemes 215-260 provide individual maps for each category of multimediacontent. The description schemes 220-260 provide description schemes forindividual multimedia categories which are integrated into a compositemultimedia description scheme 210. The multimedia integrationdescription scheme 210 is used by the multimedia integration descriptiongenerator 165 to generate a multimedia integration description for themultimedia content which is then stored in the database 170 for futureretrieval by search engine 175.

The multimedia integration description scheme (or MMDS) 210 describesmultimedia content which may contain composing data from different mediatype, such as images, natural video, audio, synthetic video, and text.The multimedia integration description scheme 210 is configured to meetthe requirements for multimedia integration description schemesspecified by MPEG-7, for example, and is independent of any descriptiondefinition language. The multimedia integration description scheme 210is also configured to achieve the maximum synergy with the separatedescription schemes, such as the image, video, audio, synthetic, andtext description schemes 220-260.

In the multimedia integration description scheme 210, a multimediastream is represented as a set of relevant multimedia objects that canbe further organized by using object hierarchies. Relationships amongmultiple multimedia objects that can not be expressed using a treestructure are described using entity relation graphs. Multimedia objectscan include multiple features, each of which can contain multipledescriptors. Each descriptor can link to external feature extraction andsimilarity matching code. Features are grouped according to thefollowing categories: media features, semantic features, and temporalfeatures.

At the same time, each multimedia object includes a set of single-mediaobjects, which together form the multimedia object. Single-media objectsare associated with features, hierarchies, entity relation graphs, andmultiple abstraction levels, as described by single media descriptionschemes (image, video, etc.). Multimedia objects are an association ofmultiple single-media objects, for example, a video object correspondingto a person, an audio object corresponding to his speech and the textobject corresponding to the transcript.

The multimedia integration description scheme 210 includes theflexibility of the object-oriented framework which is also found in theindividual media description schemes. The flexibility is achieved by (1)allowing parts of the description scheme to be instantiated; (2) usingefficient categorization of features and clustering of objects (usingthe indexing hierarchy, for example); and (3) supporting efficientlinking, embedding, or downloading of external feature descriptor andexecution codes.

Elements defined in the multimedia integration description scheme 210can be used to derive new elements for different domains. As mentionedearlier, it has been used in description scheme for specific domain(e.g., home media).

One unique aspect of the multimedia integration description scheme 210is the capability to define multiple abstraction levels based on anyarbitrary set of criteria. The criteria can be specified in terms ofvisual features (e.g., size), semantic relevance (e.g., relevance touser interest profile), or service quality (e.g., media features).

The multimedia integration description scheme 210 aims at describingmultimedia content resulting from integration of multiple media streams.Examples of individual media streams are images, audio sequences,natural video sequences, synthetic video sequences, and text data. Anexample of such integrated multimedia stream is a television programthat includes video (both natural and synthetic), audio, and textstreams.

Under the multimedia integration scheme, a multimedia stream isrepresented as a set of multimedia objects that include objects from thecomposing media streams. Multimedia objects are organized in objecthierarchies or in entity relation graphs. Relationships among two ormore multimedia objects that can not be expressed in a tree structurecan be described using multimedia entity relation graphs. The treestructures can be efficiently indexed and traversed, while the entityrelation graphs can model general relationships.

The multimedia integration description scheme 210 builds on top of theindividual media description schemes, including the image, video, audio,synthetic and text description schemes. All elements and structures usedin the multimedia integration description scheme 210 are intuitiveextensions of those used in individual media description schemes.

FIG. 3 is the example showing the basic elements and structures of themultimedia integration description scheme 210. The explanation willinclude example XML with the specific document type definitiondeclarations included in Appendix A. A more complete listing of the XMLdescription of the multimedia stream in FIG. 3 is included in AppendixB.

FIGS. 4 and 5 show the graphical representation of the proposedmultimedia description scheme following the UML notations. FIGS. 4 and 5clearly show the relationships of the multimedia description scheme 210with description schemes 220-260 for individual media. It should beemphasized that the same structure is used at the multimedia objectlevel and the single-media object level for the description schemes ofthe individual media: image, video, etc.

The multimedia stream element (<MM_Stream>) refers to the multimediacontent being described. The multimedia stream is represented as one setof multimedia objects (<MM_Object_Set>, zero or more object hierarchies(<Object_Hierarchy>), and zero or more entity relation graphs(<Entity_Relation_Graph>). Each one of these elements is described indetail below.

The multimedia stream element can include a unique identifier attributeID. Descriptions of archives containing multimedia streams will usethese IDs to reference multimedia streams.

An example of use a multimedia stream element is expressed in XMLfollows:

<!-- A multimedia Stream --> <MM_Stream id=“mmstream1”> <!-- Onemultimedia object set --> <MM_Object_Set> </MM_Object_Set> <!-- Multipleobject hierarchies --> <Object_Hierarchy> </Object_Hierarchy> ... <!--Multiple entity relation graphs --> <Entity_Relation_Graph></Entity_Relation_Graph> ... </MM_Stream>

The basic description element of the multimedia description scheme isthe multimedia object element (<MM_Object>). The set of all themultimedia objects in a multimedia stream is included within themultimedia object set (<MM_Object_Set>).

A multimedia object element includes a collection of single-mediaobjects from one or more media streams that together form a relevantentity for searching, filtering, or presentation. These single-mediaobjects may be from the same media stream or different media streams.Usually, single-media objects are from different media (e.g., audio,video, and text). These elements are defined in the description schemeof the corresponding media and may have associated hierarchies or entityrelation graphs. For purposes of discussion, the definition ofmultimedia object also allows single-media objects of the same mediatype to be used. The single-media objects in a multimedia object do notneed to be synchronized in time. In the following, “single-media object”and “media object” are used interchangeably.

The composing media objects (<Image_Object>, <Video_Object>, etc) insidea multimedia object are included in a media object set element(<Media_Object_Set>). A multimedia object element can also include zeroor more object hierarchy elements (<Object_Hierarchy>) and entityrelation graph elements (<Entity_Relation_Graph>) to describe spatial,temporal, and/or semantic relationships among the composing mediaobjects.

Each multimedia object can have associated multiple features andcorresponding feature descriptors, as discussed above. A multimediaobject can include semantic information (e.g. annotations), temporalinformation (e.g. duration), and media specific information (e.g.compression format).

Two types of multimedia objects are used: local and global objects. Aglobal multimedia object element represents the entire multimediastream. On the other hand, a local multimedia object has a limited scopewithin the multimedia stream, for example, an arbitrarily shaped videoobject representing a person and a segmented audio object correspondingto his speech. To differentiate among local and global multimediaobject, the multimedia object element includes a required attributetype, whose value can be LOCAL or GLOBAL. Only one multimedia objectwith a GLOBAL type will be included a multimedia stream description.

Each multimedia object element can also have four optional attributes:ID, Object_Ref, Object_Node_Ref, and Entity_Node_Ref. ID (or id) is aunique identifier of the multimedia object within the multimedia streamdescription. When the multimedia object acts as placeholder (i.e., nofeature descriptors included) of another multimedia object, Object_Refreferences that multimedia object. Object_Node_Ref and Entity_Node_Refinclude the lists of the identifiers of object node elements(<Object_Node>) and entity node elements (<Entity_Node>) that referencethe multimedia object element, respectively.

Below, an example showing how these elements will be used in XML isincluded. FIG. 3 includes examples of global multimedia objects (mmog0),local multimedia objects (mmol1, mmol2, etc), and media objects (aolg0,vog0, o0, etc). See Appendix B for the XML of the example in FIG. 3.

<!-- A multimedia object set element --> <MM_Object_Set> <!-- One ormore multimedia objects --> <MM_Object type=“GLOBAL” id=“MMobjt1”Object_Node_Ref=“ON1” Entity_Node_Ref=“EN1”...> <!-- A media object set--> <Media_Object_Set> </Media_Object_Set> <!-- Multiple objecthierarchies --> <Object_Hierarchy> </Object_Hierarchy> ... <!-- Multipleentity relation graphs --> <Entity_Relation_Graph></Entity_Relation_Graph> ... <!-- Zero or one multimedia object mediafeature element --> <MM_Obj_Media_Features> </MM_Obj_Media_Features><!-- Zero or one multimedia object semantic feature element --><MM_Obj_Semantic_Features> </MM_Obj_Semantic_Features> <!-- Zero or onemultimedia object temporal feature element --><MM_Obj_Temporal_Features> </MM_Obj_Temporal_Features> </MM_Object><MM_Object type=“LOCAL” id=“Mmobj2” Object_Node_Ref=“ON2 ON10”Entity_Node_Ref=“EN2 EN4 EN7”...> <!-- Content of MM Object --></MM_Object> ... </MM_Object_Set>

The set of all the single-media object elements (<Video_Object>,<Image_Object>, etc.) composing a multimedia object are included in themedia object set element (<Media_Object_Set>). Each media object refersto one type of media, such as image, video, audio, synthetic video, andtext. These media objects are defined in the description scheme of thecorresponding type of media. It is important to point out that thesingle-media objects and the multimedia objects share very similarstructures at multiple levels although the features are different.

In the same fashion as the multimedia objects, relationships amongsingle-media objects can be described using media object hierarchies orentity relation graphs. Although entity relation graphs may lack theretrieval and transversal efficiency of hierarchical structures, it isused when efficient hierarchical tree structures are not adequate todescribed specific relationships. Note that media object hierarchies mayinclude media objects of different media types.

In the multimedia integration description scheme 210, each multimediaobject can contain three multimedia object feature elements that groupfeatures based on the information they convey: media(<MM_Obj_Media_Features>), semantic (<MM_Obj_Semantic_Features>), andtemporal (<MM_Obj_Semantic_Features>) information. Note that featuresassociated with spatial positions of individual media objects (e.g.,positions of video objects) are included in each specific single-mediaobject. Spatial relationships among single-media objects inside amultimedia object are described using the entity relation graphs. Table1 includes examples of features for each feature type:

TABLE 1 Examples of feature classes and features. Feature Class FeaturesMedia Data Location, Scalable Representation, Modality Transcoding¹Semantic Text Annotation¹, Who¹, What Object¹, What Action¹, Why¹,When¹, Where¹, Keywords Temporal Duration ¹Defined in Image descriptionscheme.

Each multimedia object feature element includes corresponding descriptorelements. Specific descriptors can include links to external extractionand similarity matching code. External document type definitions fordescriptors can be imported and used in the current multimediadescription scheme. In this framework, new features, types of features,and descriptors can be included in an extensible and modular way.

In Appendix A, the declarations of the following example features areincluded: Data_Location, Scalable_Representation, Text_Annotation,Keywords, and Duration.

Each object hierarchy element (<Object_Hierarchy>) includes one objectnode element (<Object_Node>). An object hierarchy element can include aunique identifier as an attribute ID, for referencing purposes. It canalso include an attribute type, to describe the type of binding (e.g.,semantic) expressed by the hierarchy.

At the same time, an object node element (<Object_Node>) includes zeroor more object node elements forming a tree structure. Each multimediaobject node references a multimedia object in the multimedia object setthrough an attribute, Object_Ref, by using the latter's uniqueidentifier. Each object node element can also include a uniqueidentifier in the form of an attribute ID. By including the objectnodes' unique identifiers in their Object_Node_Ref attributes,multimedia objects can point back to object nodes referencing them. Forefficient transversal of the multimedia description, this mechanism isprovided to traverse from multimedia objects in the multimedia objectset to corresponding object nodes in the object hierarchy and viceversa.

The hierarchy is a way to organize the multimedia objects in themultimedia object set. The multimedia objects in a multimedia stream canbe organized based on different criteria: the temporal relationships,the semantic relationships, and the value of one or more features. Thetop object of each multimedia object hierarchy can specify the criteriafollowed to generate the hierarchy. An example multimedia objecthierarchy is detailed in XML, below:

<!-- Object hierarchy element --> <Object_Hierarchy id=“ ”> <!-- Oneobject node --> <Object_Node id=“ ” Object_Ref=“ ... ” <!-- Multipleobject nodes --> <Object_Node id=“ ... ” Object_Ref=“ ... ” > ...</Object_Node> ... </Object_Node> ... </Object_Hierarchy>

Although a hierarchy is adequate for many purposes (e.g., A is thefather of B) and is efficient in retrieval, some relationships amongmultimedia objects can not be expressed using a hierarchical structure(e.g., A is talking to B). For purposes of this discussion, themultimedia description scheme 210 also allows the specification of morecomplex relations among multimedia objects using an entity relationgraph.

An entity relation graph element (<Entity_Relation_Graph>) includes oneor more entity relation elements (<Entity_Relation>). It has twooptional attributes, a unique identifier ID, and a string to describethe binding expressed by the graph, type.

An entity relation element (<Entity_Relation>) must include one relationelement (<Relation>), zero or more entity node elements (<Entity_Node>),zero or more entity node set elements (<Entity_Node_Set>), and zero oremore entity relation elements (<Entity_Relation>). An optional attributecan be included in an entity relation element, type, to describe thetype of relation it expresses.

Each entity node element references a multimedia object in themultimedia object set through an attribute, Object_Ref, by using thelatter's unique identifier. Each entity node element can also include aunique identifier in the form of an attribute ID. By including theentity nodes' unique identifiers in their Entity_Node_Ref attributes,multimedia objects can point back to object nodes referencing them. Forefficient transversal of the multimedia description, this mechanism isprovided to traverse from multimedia objects in the multimedia objectset to corresponding entity nodes in the ER graph and vice versa.

Hierarchies and entity relation graphs can be used to state spatial,temporal, and semantic relationships among media objects. Examples ofsuch types of relationships are described below. In addition, the use ofentity relation graphs is also described in more detail.

<Entity_Relation_Graph type=″TEMPORAL.SMIL″¹> <Entity_Relationtype=″TEMPORAL″> <Relation With_Respect_To=”MediaObject1”><Temporal_Sequential pattern=″DELAY″ /> </Relation> <Entity_NodeMedia_Object_Ref=″MediaObject1″ /> <Entity_NodeMedia_Object_Ref=″MediaObject2″ Start_Time=″2″ /> <Entity_Relation><Relation type=″TEMPORAL″> <Temporal_Parallel /> </Relation><Entity_Node Media_Object_Ref=″MediaObject3″ /> <!-- Optional start orend time --> <Entity_Node Media_Object_Ref=″MediaObject4″></Entity_Relation> </Entity_Relation> </Entity_Relation_Graph><Entity_Relation_Graph type=″SPATIAL.2DSTRING″> <Entity_Relationtype=″SPATIAL″> <Relation With_Respect_To=”MediaObject1”><Spatial_Arrangement At_Time=″3″> <Spatial_Relevancepattern=”Upper_Right_Of” /> </Spatial_Arrangement> </Relation><Entity_Node Media_Object_Ref=″MediaObject1″ /> <Entity_NodeMedia_Object_Ref=″MediaObject2″ /> <Entity_NodeMedia_Object_Ref=″MediaObject3″ /> </Entity_Relation> <Entity_Relationtype=″SPATIAL″> <Relation With_Respect_To=”MediaObject2”><Spatial_Arrangement At_Time=3> <Spatial_Relevancepattern=”Lower_Left_Of “ /> </Spatial_Arrangement> </Relation><Entity_Node Media_Object_Ref=″MediaObject3″ /> <Entity_NodeMedia_Object_Ref=″MediaObject1″ /> <Entity_NodeMedia_Object_Ref=″MediaObject2″ /> </Entity_Relation></Entity_Relation_Graph> ¹In this example, SMIL temporal models areused.

The content of the relation could be particularized for each differentscenario. In Appendix A, temporal, spatial, and semantic relations areincluded. Similar types of relations could be added as needed.Acceptable relationships for specific applications can be defined inadvance. The content of this element states the relation among theentity nodes included in that entity relation element.

Many different types of relations can be declared among multiple objectsin the object set; spatial (topological or directional), temporal(topological or directional), semantic are just some type examples. Anexample of entity relation graph was included above to show how thesestructures could be used to describe temporal and spatial relationshipsamong media objects. The same example is also valid for multimediaobjects (Appendix B).

FIG. 6 is an exemplary flowchart of the multimedia integrationdescription process. The process begins in step 610, multimedia contentis received by the global description generator 115 and one or more ofthe image description generator 125, video description generator 130,audio description generator 135, synthetic description generator 140 andthe text description generator 145. In step 620, the multimediacomponents are separated and at step 630 the single media event isclassified within each of the multimedia categories in the descriptiongenerators 125-145.

In step 640, the description generators 125-145 generate descriptionsfrom each respective multimedia category which are then forwarded to themultimedia integration description generator 165. In step 650, themultimedia integration description generator 165 puts the descriptionsinto the proper format using the multimedia integration descriptionscheme provided by the multimedia integration description scheme unit170.

Then, in step 660, the multimedia integration description generator 165integrates the multimedia descriptions and in step 670, the integrateddescriptions are stored in database 160. The process then ends.

While the invention has been described with reference to theembodiments, it is to be understood that the invention is not restrictedto the particular forms shown in the foregoing embodiments. Variousmodifications and alternations can be made thereto without departingfrom the scope of the invention.

We claim:
 1. A method comprising: receiving a multimedia stream havingmedia objects; categorizing the media objects into a global media objectand a local media object, wherein the global media object represents themultimedia stream and is categorized by a first attribute type and thelocal media object represents a limited feature within the multimediastream and is categorized by a second attribute type; receiving theglobal media object at a global media description generation unit;receiving the local media object at a local media description generationunit; generating, at the global media description generation unit, aglobal multimedia object description for the global media object;generating, at the local media description generation unit, a localmultimedia object description for the local media object; generating anintegration description scheme which creates, when implemented,relationships between the global multimedia object description and thelocal multimedia object description; generating a description record torepresent a portion of the multimedia stream by integrating, accordingto the integration description scheme, the global multimedia objectdescription and the local multimedia object description; and retrievingrequested multimedia content based on a search query similarity to thedescription record.
 2. The method of claim 1, wherein the globalmultimedia object description and the local multimedia objectdescription further comprise an object hierarchy element and an entityrelation graph element.
 3. The method of claim 2, wherein the objecthierarchy element and the entity relation graph element describe one ofspatial, temporal, and semantic relationships between pieces of contentembedded within the multimedia stream.
 4. The method of claim 1, whereinthe global multimedia object description and the local multimedia objectdescription contain an optional attribute, wherein the optionalattribute comprises one of a unique identifier, a reference to anothermultimedia object description, lists of identifiers of object nodeelements, and lists of entity node elements that reference themultimedia objects.
 5. The method of claim 1, wherein the globalmultimedia object description and the local multimedia objectdescription further comprise corresponding descriptor elements.
 6. Themethod of claim 5, wherein the corresponding descriptor elementscomprise one of links to external extraction, similarity matching code,and external definitions for descriptions.
 7. A system comprising: aprocessor; and a computer-readable storage device having instructionsstored which, when executed by the processor, cause the processor toperform operations comprising: receiving a multimedia stream havingmedia objects; categorizing the media objects into a global media objectand a local media object, wherein the global media object represents themultimedia stream and is categorized by a first attribute type and thelocal media object represents a limited feature within the multimediastream and is categorized by a second attribute type; receiving theglobal media object at a global media description generation unit;receiving the local media object at a local media description generationunit; generating, at the global media description generation unit, aglobal multimedia object description for the global media object;generating, at the local media description generation unit, a localmultimedia object description for the local media object; generating anintegration description scheme which creates, when implemented,relationships between the global multimedia object description and thelocal multimedia object description; generating a description record torepresent a portion of the multimedia stream by integrating, accordingto the integration description scheme, the global multimedia objectdescription and the local multimedia object description; and retrievingrequested multimedia content based on a search query similarity to thedescription record.
 8. The system of claim 7, wherein the globalmultimedia object description and the local multimedia objectdescription further comprise an object hierarchy element and an entityrelation graph element.
 9. The system of claim 8, wherein the objecthierarchy element and the entity relation graph element describe one ofspatial, temporal, and semantic relationships between pieces of contentembedded within the multimedia stream.
 10. The system of claim 7,wherein the global multimedia object description and the localmultimedia object description contain an optional attribute, wherein theoptional attribute comprises one of a unique identifier, a reference toanother multimedia object description, lists of identifiers of objectnode elements, and lists of entity node elements that reference themultimedia objects.
 11. The system of claim 7, wherein the globalmultimedia object description and the local multimedia objectdescription further comprise corresponding descriptor elements.
 12. Thesystem of claim 11, wherein the corresponding descriptor elementscomprise one of links to external extraction, similarity matching code,and external definitions for descriptions.
 13. A computer-readablestorage device having instructions stored which, when executed by acomputing device, cause the computing device to perform operationscomprising: receiving a multimedia stream having media objects;categorizing the media objects into a global media object and a localmedia object, wherein the global media object represents the multimediastream and is categorized by a first attribute type and the local mediaobject represents a limited feature within the multimedia stream and iscategorized by a second attribute type; receiving the global mediaobject at a global media description generation unit; receiving thelocal media object at a local media description generation unit;generating, at the global media description generation unit, a globalmultimedia object description for the global media object; generating,at the local media description generation unit, a local multimediaobject description for the local media object; generating an integrationdescription scheme which creates, when implemented, relationshipsbetween the global multimedia object description and the localmultimedia object description; generating a description record torepresent a portion of the multimedia stream by integrating, accordingto the integration description scheme, the global multimedia objectdescription and the local multimedia object description; and retrievingrequested multimedia content based on a search query similarity to thedescription record.
 14. The computer-readable storage device of claim13, wherein the global multimedia object description and the localmultimedia object description further comprise an object hierarchyelement and an entity relation graph element.
 15. The computer-readablestorage device of claim 14, wherein the object hierarchy element and theentity relation graph element describe one of spatial, temporal, andsemantic relationships between pieces of content embedded within themultimedia stream.
 16. The computer-readable storage device of claim 13,wherein the global multimedia object description and the localmultimedia object description contain an optional attribute, wherein theoptional attribute comprises one of a unique identifier, a reference toanother multimedia object description, lists of identifiers of objectnode elements, and lists of entity node elements that reference themultimedia objects.
 17. The computer-readable storage device of claim13, wherein the global multimedia object description and the localmultimedia object description further comprise corresponding descriptorelements.